A decade or more ago I read a nice worked example from the political scientist Simon Jackman demonstrating how to do Principal Components Analysis, one of the basic techniques for reducing data high-dimensional data (i.e. data with many variables or features) to some much smaller subset that nevertheless represents or condenses the information in the data in some useful way. In PCA, we transform the data in order to find the “best” set of underlying components. We want the dimensions we choose to be orthogonal to one another—that is, linearly uncorrelated. An interactive introduction to some of the intuitions behind PCA can be found at setosa.io (written by Victor Powell and Lewis Lehe). Next take a look at this related discussion by Matt Brems. And finally here’s a tutorial by Lindsay Smith (PDF) that covers a little more of the algebra along with working through an example.
Like much of the toolkit of linear algebra, PCA has many applications. In the context of data analysis it can be thought of as an inductive method where a lot of the interpretation of the components is left up to the researcher. Because of the way it works, we’re arithmetically guaranteed to find a set of components that “explain” all the variance we observe. The substantive explanatory question is whether the main components uncovered by PCA have a plausible interpretation. T
The second half of this discussion has an example using some high-dimensional social science data or, to put it rather more mundanely, a bunch of different measures of characteristics of counties in the Midwestern United States. But to begin with let’s motivate PCA in a different way. Another way of saying “PCA is a technique for high-dimensional data reduction” is to say “PCA is a technique for summarizing data with many details into a smaller amount of information that retains the general gist of the original”. Another way of saying that is that PCA is a techique for compressing data down to a few key elements — or, well, a few principal components.
A nice thing about Jackman’s discussion was that he did PCA on an image, in order to show both how you could reconstruct the whole image from the PCA and, more importantly, to provide some intuition about what the first few components of a PCA picked up on. His discussion doesn’t seem to be available anymore, I rewrote the example myself. I’ll use the same image he did. This one:
Elvis meets Nixon
We set up our tidyverse toolkit as usual.
# install.packages(c("tidyverse", "here", "broom"), repos = "http://cran.rstudio.com")
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.1 ✓ dplyr 1.0.0
## ✓ tidyr 1.1.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x purrr::is_null() masks testthat::is_null()
## x dplyr::lag() masks stats::lag()
## x dplyr::matches() masks tidyr::matches(), testthat::matches()
library(here)
## here() starts at /Users/kjhealy/Documents/data/misc/mfa_pca
library(broom)
For managing the image, the Imager Library is our friend here. It’s a great toolkit for processing images in R, and it’s friendly to tidyverse packages, too.
# install.packages("imager", repos = "http://cran.rstudio.com")
library(imager)
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
## The following objects are masked from 'package:testthat':
##
## equals, is_less_than, not
##
## Attaching package: 'imager'
## The following object is masked from 'package:magrittr':
##
## add
## The following object is masked from 'package:stringr':
##
## boundary
## The following object is masked from 'package:tidyr':
##
## fill
## The following objects are masked from 'package:stats':
##
## convolve, spectrum
## The following object is masked from 'package:graphics':
##
## frame
## The following object is masked from 'package:grid':
##
## depth
## The following object is masked from 'package:base':
##
## save.image
Our image is in the img/ subfolder of our project directory. The load.image() function is from Imager. It imports the image as a cimg object. The library provides a method to convert these objects to a long-form data frame. Our image is grayscale, which makes it easier to work with. It’s 800 pixels wide by 633 pixels tall.
img <- load.image(here("img/elvis-nixon.jpeg"))
str(img)
## 'cimg' num [1:800, 1:633, 1, 1] 0.914 0.929 0.91 0.906 0.898 ...
dim(img)
## [1] 800 633 1 1
img_df_long <- as.data.frame(img)
head(img_df_long)
## x y value
## 1 1 1 0.9137255
## 2 2 1 0.9294118
## 3 3 1 0.9098039
## 4 4 1 0.9058824
## 5 5 1 0.8980392
## 6 6 1 0.8862745
Each x-y pair is a location in the 800 by 600 pixel grid, and the value is a grayscale value ranging from zero to one. To do a PCA we will need a matrix of data in wide format, though—one that reproduces the shape of the image (i.e. a rectangle), just as a numerical matrix. We’ll widen it using pivot_wider:
img_df <- tidyr::pivot_wider(img_df_long,
names_from = y,
values_from = value)
dim(img_df)
## [1] 800 634
## Look at first five rows and first five columns
img_df[1:5, 1:5]
## # A tibble: 5 x 5
## x `1` `2` `3` `4`
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.914 0.914 0.914 0.910
## 2 2 0.929 0.929 0.925 0.918
## 3 3 0.910 0.910 0.902 0.894
## 4 4 0.906 0.902 0.898 0.894
## 5 5 0.898 0.894 0.890 0.886
Notice the x column there, which just name each of the rows. It’s not part of the matrix as such
# Image data
tmp <- img_df %>% select(-x)
dim(tmp)
## [1] 800 633
tmp[1:5, 1:5]
## # A tibble: 5 x 5
## `1` `2` `3` `4` `5`
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.914 0.914 0.914 0.910 0.902
## 2 0.929 0.929 0.925 0.918 0.910
## 3 0.910 0.910 0.902 0.894 0.886
## 4 0.906 0.902 0.898 0.894 0.890
## 5 0.898 0.894 0.890 0.886 0.882
# Scaled and centered
tmp_norm <- scale(tmp, center = TRUE, scale = TRUE)
tmp_norm[1:5, 1:5]
## 1 2 3 4 5
## [1,] 0.3062542 0.3271632 0.3352950 0.3137665 0.2662779
## [2,] 0.4172912 0.4326572 0.4124186 0.3645892 0.3165605
## [3,] 0.2784949 0.3007897 0.2581714 0.2121212 0.1657127
## [4,] 0.2507356 0.2480427 0.2324635 0.2121212 0.1908540
## [5,] 0.1952171 0.1952957 0.1810477 0.1612985 0.1405714
# Covariance matrix
cov_mat <- cov(tmp_norm)
dim(cov_mat)
## [1] 633 633
cov_mat[1:5, 1:5]
## 1 2 3 4 5
## 1 1.0000000 0.9945150 0.9877773 0.9848553 0.9820028
## 2 0.9945150 1.0000000 0.9966495 0.9902557 0.9834383
## 3 0.9877773 0.9966495 1.0000000 0.9958584 0.9866195
## 4 0.9848553 0.9902557 0.9958584 1.0000000 0.9931752
## 5 0.9820028 0.9834383 0.9866195 0.9931752 1.0000000
# Decomposition/Factorization into
# Eigenvalues and eigenvectors
cov_eig <- eigen(cov_mat)
names(cov_eig)
## [1] "values" "vectors"
# Eigenvalues (variances)
cov_evals <- cov_eig$values
cov_evals[1:5]
## [1] 231.41622 120.60984 56.89806 31.05158 22.82532
# Eigenvectors (principal components)
cov_evecs <- cov_eig$vectors
cov_evecs[1:5, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.006164854 -0.06573463 0.02875530 -0.03933564 0.06010992
## [2,] -0.006734353 -0.06614096 0.02858038 -0.03849782 0.05901673
## [3,] -0.007154980 -0.06590388 0.02736392 -0.03892858 0.05852467
## [4,] -0.007471463 -0.06604222 0.02671015 -0.03827302 0.05944892
## [5,] -0.006484754 -0.06614490 0.02788219 -0.03994717 0.06062967
# Rotation matrix -- i.e. the coordinates of the
# original data points translated into the
# transformed coordinate space prcomp$rotation
tmp_rot <- tmp_norm %*% cov_evecs
dim(tmp_rot)
## [1] 800 633
tmp_rot[1:5, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 11.589359 -4.479811 -2.268584 3.666006 4.655793
## [2,] 10.917918 -4.386686 -3.477785 4.237974 6.417180
## [3,] 9.411595 -4.239953 -4.511044 3.983355 6.585201
## [4,] 8.768736 -3.946033 -4.653053 3.398185 4.503424
## [5,] 8.452248 -3.153785 -4.491075 2.568274 1.609303
# Should be zero
round(cov(cov_evecs), 2)[1:5,1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 0 0
## [2,] 0 0 0 0 0
## [3,] 0 0 0 0 0
## [4,] 0 0 0 0 0
## [5,] 0 0 0 0 0
prcomp() insteadWe don’t need to do the PCA manually, of course. Dropping the x column that ids the rows, instead we feeding the 800x633 matrix to R’s prcomp() function.
img_pca <- img_df %>%
dplyr::select(-x) %>%
prcomp(scale = TRUE, center = TRUE)
We can tidy the output of prcomp with the broom package’s tidy function. We’ll do that to get a summary scree plot showing the variance “explained” by each component.
pca_tidy <- tidy(img_pca, matrix = "pcs")
## What it looks like
pca_tidy
## # A tibble: 633 x 4
## PC std.dev percent cumulative
## <dbl> <dbl> <dbl> <dbl>
## 1 1 15.2 0.366 0.366
## 2 2 11.0 0.191 0.556
## 3 3 7.54 0.0899 0.646
## 4 4 5.57 0.0491 0.695
## 5 5 4.78 0.0361 0.731
## 6 6 4.56 0.0328 0.764
## 7 7 4.06 0.0261 0.790
## 8 8 3.66 0.0212 0.811
## 9 9 3.37 0.0179 0.829
## 10 10 3.28 0.0170 0.846
## # … with 623 more rows
pca_tidy %>%
ggplot(aes(x = PC, y = percent)) +
geom_line() +
scale_y_continuous(labels = scales::percent) +
labs(x = "Principal Component", y = "Percent Variance Explained")
Looks like a lot of the “information” in the image is contained in just the first few principal components.
Now comes the fun bit. The object produced by prcomp() has a few pieces inside:
names(img_pca)
## [1] "sdev" "rotation" "center" "scale" "x"
What are these? sdev contains the standard deviations of the principal components. rotation is a matrix where the rows correspond to the columns of the original data, and the columns are the principal components. x is a matrix containing the value of the rotated data multiplied by the rotation matrix. Finally, center and scale are vectors showing the centering and scaling adjustments for each observation.
We’re going to run the PCA backwards. We can run it backwards to perfectly reconstitute the information in the original image simply by reversing all of our steps. Thus, we need to multiply x by the transpose of the rotation matrix, and then remove the centering and remove the scaling. If we multiply by the transpose of the full rotation matrix (and then un-center and un-scale), we’ll recover the original data matrix exactly. This is just a matter of running the matrix multiplication and so on in the reverse direction.
But, thanks to the way matrix multiplication works, we can also choose to use just the first few principal components from our PCA when doing the reversal. There are 633 components in all (corresponding to the number of rows in the original data matrix). As we saw, though, the scree plot suggests that most of the data is “explained” by a much smaller number of components than that.
Here’s a function that takes a PCA object created by prcomp() and returns an approximation of the original data, calculated by some number (n_comp) of principal components. It returns its results in long format, in a way that mirrors what the Imager library wants. This will make plotting easier in a minute.
reverse_pca <- function(n_comp = 20, pca_object = img_pca){
## The pca_object is an object created by base R's prcomp() function.
## Multiply the matrix of rotated data by the transpose of the matrix
## of eigenvalues (i.e. the component loadings) to get back to a
## matrix of original data values. But don't use the full matrix, use
## some subset of Principal Components defined by n_comp. By default,
## components 1 to 20. n_comp can be any number fom 1 to the number
## of components in the pca_object.
recon <- pca_object$x[, 1:n_comp] %*% t(pca_object$rotation[, 1:n_comp])
## Reverse any scaling and centering that was done by prcomp()
if(all(pca_object$scale != FALSE)){
## Rescale by the reciprocal of the scaling factor, i.e. back to
## original range.
recon <- scale(recon, center = FALSE, scale = 1/pca_object$scale)
}
if(all(pca_object$center != FALSE)){
## Remove any mean centering by adding the subtracted mean back in
recon <- scale(recon, scale = FALSE, center = -1 * pca_object$center)
}
## Make recon a data frame that we can easily pivot to long format
## (because that's the format that the excellent imager library wants
## when drawing image plots with ggplot)
recon_df <- data.frame(cbind(1:nrow(recon), recon))
colnames(recon_df) <- c("x", 1:(ncol(recon_df)-1))
## Return the data to long form
recon_df_long <- recon_df %>%
tidyr::pivot_longer(cols = -x,
names_to = "y",
values_to = "value") %>%
mutate(y = as.numeric(y)) %>%
arrange(y) %>%
as.data.frame()
## Return the result to the user
recon_df_long
}
Let’s put the function to work by mapping it to our PCA object, and reconstructing our image based on the first 2, 3, 4, 5, 10, 20, 50, and 100 principal components.
## The sequence of PCA components we want
n_pcs <- c(2:5, 10, 20, 50, 100)
names(n_pcs) <- paste("First", n_pcs, "Components", sep = "_")
## Map reverse_pca()
recovered_imgs <- map_dfr(n_pcs,
reverse_pca,
.id = "pcs") %>%
mutate(pcs = stringr::str_replace_all(pcs, "_", " "),
pcs = factor(pcs, levels = unique(pcs), ordered = TRUE))
This gives us a very long tibble with an index (pcs) for the number of components used to reconstruct the image. In essence it’s eight images stacked on top of one another, with each image being reconstituted using a larger number of components than before. Now we can plot each image in a small multiple.
p <- ggplot(data = recovered_imgs,
mapping = aes(x = x, y = y, fill = value))
p_out <- p + geom_raster() +
scale_y_reverse() +
scale_fill_gradient(low = "black", high = "white") +
facet_wrap(~ pcs, ncol = 2) +
guides(fill = FALSE) +
labs(title = "Recovering the content of an 800x600 pixel image\nfrom a Principal Components Analysis of its pixels") +
theme(strip.text = element_text(face = "bold", size = rel(1.2)),
plot.title = element_text(size = rel(1.5)))
p_out
ggsave(here("figures/elvis-pca.png"), p_out, height = 16, width = 8)
There’s a lot more one could do with this, especially if I knew rather more linear algebra than I in fact do haha. But at any rate we can see that it’s pretty straightforward to use R to play around with PCA and images in a tidy framework.
Lets use the midwest data that we’ve seen once or twice before to apply PCA in a data analysis context. Remember what it looks like:
midwest
## # A tibble: 437 x 28
## PID county state area poptotal popdensity popwhite popblack popamerindian
## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
## 2 562 ALEXA… IL 0.014 10626 759 7054 3496 19
## 3 563 BOND IL 0.022 14991 681. 14477 429 35
## 4 564 BOONE IL 0.017 30806 1812. 29344 127 46
## 5 565 BROWN IL 0.018 5836 324. 5264 547 14
## 6 566 BUREAU IL 0.05 35688 714. 35157 50 65
## 7 567 CALHO… IL 0.017 5322 313. 5298 1 8
## 8 568 CARRO… IL 0.027 16805 622. 16519 111 30
## 9 569 CASS IL 0.024 13437 560. 13384 16 8
## 10 570 CHAMP… IL 0.058 173025 2983. 146506 16559 331
## # … with 427 more rows, and 19 more variables: popasian <int>, popother <int>,
## # percwhite <dbl>, percblack <dbl>, percamerindan <dbl>, percasian <dbl>,
## # percother <dbl>, popadults <int>, perchsd <dbl>, percollege <dbl>,
## # percprof <dbl>, poppovertyknown <int>, percpovertyknown <dbl>,
## # percbelowpoverty <dbl>, percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## # percelderlypoverty <dbl>, inmetro <int>, category <chr>
There are a bunch of US counties from Midwestern states, and we have a whole bunch of numeric measures from the Census. Let’s group them by state and then select just the numeric measures:.
mw_pca <- midwest %>%
group_by(state) %>%
select_if(is.numeric) %>%
select(-PID)
mw_pca
## # A tibble: 437 x 25
## # Groups: state [5]
## state area poptotal popdensity popwhite popblack popamerindian popasian
## <chr> <dbl> <int> <dbl> <int> <int> <int> <int>
## 1 IL 0.052 66090 1271. 63917 1702 98 249
## 2 IL 0.014 10626 759 7054 3496 19 48
## 3 IL 0.022 14991 681. 14477 429 35 16
## 4 IL 0.017 30806 1812. 29344 127 46 150
## 5 IL 0.018 5836 324. 5264 547 14 5
## 6 IL 0.05 35688 714. 35157 50 65 195
## 7 IL 0.017 5322 313. 5298 1 8 15
## 8 IL 0.027 16805 622. 16519 111 30 61
## 9 IL 0.024 13437 560. 13384 16 8 23
## 10 IL 0.058 173025 2983. 146506 16559 331 8033
## # … with 427 more rows, and 17 more variables: popother <int>, percwhite <dbl>,
## # percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## # popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## # poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## # percelderlypoverty <dbl>, inmetro <int>
Note the use of select_if() there. We also drop PID because although it’s numeric, it’s a case identifier, not a measured variable.
Now let’s write a PCA helper function that’s specific to the data we’re working with. It takes some data, df and then does the thing we want to that data—in this case, fit a PCA using the prcomp function.
do_pca <- function(df){
prcomp(df,
center = TRUE, scale = TRUE)
}
The center and scale arguments are for prcomp. PCA results are sensitive to how variables are measured, so it is conventional to center them (by subtracting the mean of each one) and scale them (by dividing by the standard deviations). This makes the resulting numerical values more directly comparable.
As before we could do a PCA on the whole dataset:
out_pca <- mw_pca %>%
ungroup() %>%
select(-state) %>%
do_pca()
If you print the results of a PCA analysis to the console, you will see a square table of numbers. The rows of this table will have the same names as the columns from our original data, i.e., the variables. The columns are the orthogonal principal components, named PC1 to PC24 in this case. (Each column is an eigenvector.) There will be as many components as variables. Ideally, the first few components will “explain” most of the variation in the data, and the way that the original variables are associated with each component will have some sort of substantively plausible interpretation.
Here’s peek at the components for the whole dataset:
out_pca
## Standard deviations (1, .., p=24):
## [1] 3.098601e+00 2.209601e+00 1.649472e+00 1.192893e+00 1.121586e+00
## [6] 8.977640e-01 8.859441e-01 8.194829e-01 6.921234e-01 5.649742e-01
## [11] 5.439431e-01 4.854143e-01 3.799956e-01 3.583348e-01 3.094795e-01
## [16] 2.500930e-01 2.087861e-01 1.924391e-01 9.654481e-02 3.473006e-02
## [21] 1.328238e-02 3.862376e-03 2.886151e-09 5.193302e-16
##
## Rotation (n x k) = (24 x 24):
## PC1 PC2 PC3 PC4
## area -0.026194151 0.014996226 -0.103615981 0.25080525
## poptotal -0.312990340 0.051517565 0.110772033 0.05437383
## popdensity -0.292047651 0.023597548 0.045576790 -0.10421671
## popwhite -0.314102501 0.023751463 0.081172077 0.02779283
## popblack -0.287733756 0.107425973 0.149017651 0.05977198
## popamerindian -0.269188348 0.091620270 -0.012129451 -0.12363778
## popasian -0.284059460 0.057769639 0.145218465 0.21322836
## popother -0.258309265 0.080662808 0.196512506 0.21628387
## percwhite 0.182514833 -0.175573945 0.254864514 0.43714304
## percblack -0.198800629 0.111941861 -0.153186027 -0.25421194
## percamerindan -0.001534117 0.174540376 -0.184268646 -0.40652737
## percasian -0.191218173 -0.127231985 -0.332956738 0.20775316
## percother -0.173671168 -0.050556657 0.030173428 -0.09437040
## popadults -0.312119508 0.052857188 0.116377419 0.05450157
## perchsd -0.094221446 -0.355863355 -0.181247085 -0.04333187
## percollege -0.153071342 -0.255449911 -0.344841584 0.08121518
## percprof -0.143077470 -0.206478282 -0.392783145 0.16013389
## poppovertyknown -0.312587935 0.052236356 0.114220833 0.05273857
## percpovertyknown 0.016028294 0.007402401 0.359748016 -0.34952458
## percbelowpoverty 0.021855622 0.402909415 -0.237353423 0.08207085
## percchildbelowpovert 0.009476823 0.412110822 -0.150115286 -0.04124745
## percadultpoverty 0.014088458 0.358988477 -0.309021903 0.15457501
## percelderlypoverty 0.085358935 0.368070961 0.007658326 0.18553385
## inmetro -0.137665882 -0.192436724 -0.125070989 -0.33100508
## PC5 PC6 PC7 PC8
## area -0.6268042152 0.486893708 -0.453569767 0.1079319668
## poptotal 0.0024970161 0.040578465 0.048861233 -0.0285013319
## popdensity 0.1390525473 0.151857172 0.069362791 -0.0664995961
## popwhite 0.0163175123 0.072589923 0.056167495 -0.0008575316
## popblack 0.0061929483 0.048796558 0.016831570 -0.1220243486
## popamerindian -0.1960220830 0.141355661 0.052576144 -0.1347051621
## popasian -0.0833969809 -0.162859041 0.128790221 0.0472430208
## popother -0.1104914802 -0.262246435 0.039072780 0.0526606531
## percwhite 0.0823408813 0.075888521 0.129995786 0.1597830701
## percblack 0.3439790573 0.252961702 -0.383051501 -0.1771722325
## percamerindan -0.5220486545 -0.288345089 0.324526942 -0.1461253649
## percasian 0.0575491870 -0.063473221 0.116589841 0.1804793473
## percother -0.0218809886 -0.583801828 -0.595802991 0.3901700486
## popadults 0.0007915297 0.041900620 0.052095520 -0.0298336006
## perchsd -0.1857555151 0.036568047 0.041004313 0.0396282112
## percollege -0.0243188162 0.082166160 0.122560453 0.1811713006
## percprof 0.1159894026 -0.039804454 0.193789766 0.1010791912
## poppovertyknown 0.0015598333 0.041736565 0.049619713 -0.0273696954
## percpovertyknown -0.0787444895 0.257253963 0.237599189 0.6826519171
## percbelowpoverty 0.0444121328 -0.003257345 0.053924973 0.1821190681
## percchildbelowpovert 0.0512103005 0.071824098 -0.005100555 0.2244817257
## percadultpoverty 0.0356366301 -0.037897304 0.076111737 0.1793070629
## percelderlypoverty 0.1419281187 0.082819310 0.024862129 0.0742401752
## inmetro 0.2183862033 0.157288149 -0.003008790 0.2202521313
## PC9 PC10 PC11 PC12
## area 0.12663712 -0.094179408 0.11231535 -0.04313244
## poptotal 0.01552267 0.008775679 -0.01791350 -0.01501762
## popdensity -0.13895834 0.342485375 0.17787235 0.05949232
## popwhite 0.01271539 0.052449201 0.06560517 -0.03436697
## popblack -0.02692551 -0.004457095 -0.17937034 0.06297394
## popamerindian -0.04109829 0.449228928 0.10953017 0.37974128
## popasian 0.10107752 -0.237369140 -0.02627899 -0.13675147
## popother 0.17717451 -0.305645557 -0.22440673 -0.08627011
## percwhite 0.13213941 0.284848619 -0.05544924 0.09470116
## percblack -0.21602790 -0.307159882 -0.10823698 -0.15912937
## percamerindan 0.06453843 -0.127487786 0.11470316 0.04386861
## percasian -0.07144517 -0.017682248 0.59303988 -0.36123161
## percother -0.08942021 0.176666108 0.06711022 0.20749609
## popadults 0.01640764 0.005443480 -0.02055036 -0.02058595
## perchsd -0.14299416 0.021982405 -0.54219851 0.08183392
## percollege -0.18490572 -0.085836394 -0.15416795 0.21810997
## percprof -0.12699026 -0.132671989 -0.01548882 0.17960053
## poppovertyknown 0.01462687 0.008363739 -0.01921287 -0.01472497
## percpovertyknown -0.29522179 -0.137890254 0.02944298 -0.06807352
## percbelowpoverty 0.04715694 0.110133227 -0.14388516 -0.02624749
## percchildbelowpovert 0.03941614 0.196088069 -0.22772663 -0.22516992
## percadultpoverty 0.07247670 0.200967060 -0.19297958 -0.07394945
## percelderlypoverty -0.04984767 -0.397177629 0.20448090 0.66565555
## inmetro 0.81787338 -0.053569239 0.02223327 0.13178465
## PC13 PC14 PC15 PC16
## area 0.1492947689 0.049251461 -0.041581191 0.069182502
## poptotal 0.0809142843 0.094209562 -0.047035217 -0.019016740
## popdensity 0.1514018019 0.470174170 0.073725534 0.137872820
## popwhite 0.1025918501 0.227604890 0.155434954 0.125476506
## popblack 0.0891663464 -0.164520239 -0.647266067 -0.421509437
## popamerindian -0.3335273686 -0.557968501 0.174302448 0.088617583
## popasian -0.1235609452 -0.018281950 0.236492901 0.127410659
## popother -0.1072405390 -0.206522022 0.192288737 0.064863701
## percwhite 0.0161116446 0.005368624 0.006960275 -0.006191143
## percblack -0.0924593171 -0.140982560 0.064505197 0.054331701
## percamerindan 0.1254641220 0.143249174 -0.050450016 -0.034625963
## percasian -0.4171948840 -0.014102307 -0.173226241 -0.114670694
## percother 0.0638451377 0.053806551 -0.051329033 -0.007277167
## popadults 0.0847188440 0.111948754 -0.027655022 -0.007651811
## perchsd -0.5158899034 0.327909623 -0.181385662 0.230728906
## percollege 0.1647438288 0.017464859 0.418025355 -0.640037866
## percprof 0.4727125977 -0.293467485 -0.160774812 0.517382126
## poppovertyknown 0.0797510183 0.095501068 -0.047804340 -0.023710576
## percpovertyknown -0.0000828557 -0.130539767 -0.099251884 0.047437659
## percbelowpoverty -0.0261022587 0.042007626 -0.053588709 0.006592897
## percchildbelowpovert 0.0056653102 0.002253970 0.283931025 0.030674746
## percadultpoverty 0.0043048799 0.013080016 -0.225560922 -0.014777455
## percelderlypoverty -0.2378679877 0.237106801 -0.030680141 0.038169302
## inmetro -0.0385911067 0.011116430 -0.055901908 -0.025376419
## PC17 PC18 PC19 PC20
## area -0.080053175 -0.03173620 0.0086713439 0.0046807825
## poptotal 0.222846757 0.08703188 -0.0761931856 0.0230540916
## popdensity -0.574012367 -0.28222971 -0.0054579629 0.0117038685
## popwhite 0.466363870 0.07768886 -0.1407278107 0.0302968726
## popblack -0.255732749 0.19708095 0.1048230953 0.0086453550
## popamerindian 0.042838025 -0.03869656 0.0216649073 0.0046475411
## popasian -0.097050949 -0.03676084 0.7810868830 0.0340203731
## popother -0.315190248 -0.25802711 -0.5546364787 -0.0108903942
## percwhite -0.006903346 0.04869608 -0.0073086312 0.0155809463
## percblack 0.025791904 -0.12477880 0.0346139218 0.0044765970
## percamerindan -0.016219619 0.04163721 -0.0191989284 -0.0268487279
## percasian -0.051979291 0.10724876 -0.1273361926 -0.0157805258
## percother 0.027054066 0.04714364 0.0490008640 -0.0020609153
## popadults 0.225752151 0.08733846 -0.0768707766 -0.0917147487
## perchsd -0.001809384 0.09493895 -0.0150551778 0.0018112905
## percollege -0.002842201 -0.07361662 0.0195221284 0.0079930821
## percprof -0.080014048 0.13539216 -0.0053316027 -0.0068005627
## poppovertyknown 0.226822314 0.08897499 -0.0690617608 0.0035532809
## percpovertyknown 0.019687511 -0.10975064 0.0003994635 0.0008649907
## percbelowpoverty 0.046481431 -0.06957710 -0.0174105405 0.8250598558
## percchildbelowpovert -0.236560793 0.62012162 -0.0515270073 -0.2814040372
## percadultpoverty 0.216599873 -0.54943154 0.0903702718 -0.4610369991
## percelderlypoverty -0.023343777 0.11150431 -0.0369451797 -0.1215578702
## inmetro -0.041736731 -0.00287405 0.0326968965 0.0008015254
## PC21 PC22 PC23 PC24
## area 0.0006100954 8.367129e-04 9.258598e-11 -1.363506e-16
## poptotal 0.2623663997 -2.695286e-01 3.557253e-09 8.095608e-01
## popdensity 0.0067130253 3.678380e-03 2.870400e-10 -1.526557e-16
## popwhite 0.3418239111 -3.428497e-01 -9.776444e-09 -5.435525e-01
## popblack 0.1136442033 -1.399848e-01 -4.676845e-09 -2.143790e-01
## popamerindian -0.0078097728 -2.075393e-03 -1.201757e-10 -2.359217e-03
## popasian 0.0182710315 -1.946581e-02 -3.914513e-10 -2.584333e-02
## popother 0.0354980324 -2.634532e-02 7.874338e-11 -5.030128e-02
## percwhite -0.0010722219 -1.130704e-04 7.149193e-01 -6.579316e-09
## percblack -0.0013967824 3.417253e-04 5.180564e-01 -4.767611e-09
## percamerindan 0.0048244342 -2.217079e-04 4.575489e-01 -4.210767e-09
## percasian -0.0055630731 8.872089e-04 6.333343e-02 -5.828499e-10
## percother -0.0043170883 -6.030141e-04 8.453324e-02 -7.779492e-10
## popadults -0.8790319357 -1.030783e-01 -1.770633e-08 8.743006e-16
## perchsd 0.0021093628 -1.113264e-04 2.265936e-10 2.775558e-17
## percollege -0.0019738770 -1.664292e-03 -7.269330e-11 8.326673e-17
## percprof -0.0020900394 3.814056e-03 -1.656475e-10 -4.857226e-17
## poppovertyknown 0.1303632718 8.822909e-01 2.864555e-08 6.508682e-15
## percpovertyknown -0.0012238472 -2.699457e-03 -2.291530e-10 1.665335e-16
## percbelowpoverty -0.0830583366 4.745691e-03 4.430128e-09 -1.387779e-16
## percchildbelowpovert 0.0335261456 -7.310973e-04 -8.287914e-10 -1.526557e-16
## percadultpoverty 0.0416354842 -2.949434e-03 -2.967820e-09 4.718448e-16
## percelderlypoverty 0.0126368821 -1.055024e-03 -6.503165e-10 6.418477e-17
## inmetro -0.0026678800 -5.150479e-05 -9.243588e-11 -2.775558e-16
You can also get a summary of the components:
summary(out_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.0986 2.2096 1.6495 1.19289 1.12159 0.89776 0.8859
## Proportion of Variance 0.4001 0.2034 0.1134 0.05929 0.05241 0.03358 0.0327
## Cumulative Proportion 0.4001 0.6035 0.7168 0.77614 0.82856 0.86214 0.8948
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.81948 0.69212 0.5650 0.54394 0.48541 0.38000 0.35833
## Proportion of Variance 0.02798 0.01996 0.0133 0.01233 0.00982 0.00602 0.00535
## Cumulative Proportion 0.92283 0.94278 0.9561 0.96841 0.97823 0.98425 0.98960
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.30948 0.25009 0.20879 0.19244 0.09654 0.03473 0.01328
## Proportion of Variance 0.00399 0.00261 0.00182 0.00154 0.00039 0.00005 0.00001
## Cumulative Proportion 0.99359 0.99619 0.99801 0.99955 0.99994 0.99999 1.00000
## PC22 PC23 PC24
## Standard deviation 0.003862 2.886e-09 5.193e-16
## Proportion of Variance 0.000000 0.000e+00 0.000e+00
## Cumulative Proportion 1.000000 1.000e+00 1.000e+00
There’s a broom method for PCA, so we can tidy the results. The function can tidy up in various ways. We use the matrix argument to get tidy information on the principal components.
tidy_pca <- tidy(out_pca, matrix = "pcs")
tidy_pca
## # A tibble: 24 x 4
## PC std.dev percent cumulative
## <dbl> <dbl> <dbl> <dbl>
## 1 1 3.10 0.400 0.400
## 2 2 2.21 0.203 0.603
## 3 3 1.65 0.113 0.717
## 4 4 1.19 0.0593 0.776
## 5 5 1.12 0.0524 0.829
## 6 6 0.898 0.0336 0.862
## 7 7 0.886 0.0327 0.895
## 8 8 0.819 0.0280 0.923
## 9 9 0.692 0.0200 0.943
## 10 10 0.565 0.0133 0.956
## # … with 14 more rows
You can see from the percent and cumulative columns that the first principal component accounts for 40% of the variance. The second accounts for about 20% by itself and 60% cumulatively, and so on. By the fourth component we’re up to 77 percent accounted for. Note again that although it’s conventional to say that the components “explain” the variance in the variables, this is something that’s mathematically guaranteed by the way that the calculation happens. Whether this purely formal sense of “explanation” translates into something more substantively explanatory is a separate question.
Now we can make a “scree plot”, showing the relative importance of the components. Ideally we’d like the first four or so to account for almost all the variance:
tidy_pca %>%
ggplot(aes(x = PC, y = percent)) +
geom_line() +
labs(x = "Principal Component", y = "Variance Explained")
Not bad.
We can also project the original data points back in, using broom’s augment function (to give tidy observation-level summaries) rather than tidy.
aug_pca <- augment(out_pca, data = mw_pca[,-1])
aug_pca <- aug_pca %>%
tibble::add_column(midwest$state, midwest$county, .before = TRUE) %>%
rename(state = `midwest$state`, county = `midwest$county`)
ggplot(data = aug_pca,
mapping = aes(x = .fittedPC1,
y = .fittedPC2,
color = state)) +
geom_point()
Now, Let’s say instead of doing a PCA on the whole dataset at once, we wanted to do it within each state instead. This is where our split-apply-combine approach comes in. First we take our mw_pca data and nest it within states:
mw_pca <- mw_pca %>%
group_by(state) %>%
nest()
mw_pca
## # A tibble: 5 x 2
## # Groups: state [5]
## state data
## <chr> <list>
## 1 IL <tibble [102 × 24]>
## 2 IN <tibble [92 × 24]>
## 3 MI <tibble [83 × 24]>
## 4 OH <tibble [88 × 24]>
## 5 WI <tibble [72 × 24]>
Now we can do the PCA by group (i.e. by state) and tidy the results:
state_pca <- mw_pca %>%
mutate(pca = map(data, do_pca))
state_pca
## # A tibble: 5 x 3
## # Groups: state [5]
## state data pca
## <chr> <list> <list>
## 1 IL <tibble [102 × 24]> <prcomp>
## 2 IN <tibble [92 × 24]> <prcomp>
## 3 MI <tibble [83 × 24]> <prcomp>
## 4 OH <tibble [88 × 24]> <prcomp>
## 5 WI <tibble [72 × 24]> <prcomp>
This gives us a new list column, pca, each row of which is an object that contains all the results of running prcomp(). We can add a second list column with the tidied summary. Again we’ll write a helper function to make what we’re doing a little more legible.
do_tidy <- function(pr){
broom::tidy(pr, matrix = "pcs")
}
state_pca <- mw_pca %>%
mutate(pca = map(data, do_pca),
pcs = map(pca, do_tidy))
state_pca
## # A tibble: 5 x 4
## # Groups: state [5]
## state data pca pcs
## <chr> <list> <list> <list>
## 1 IL <tibble [102 × 24]> <prcomp> <tibble [24 × 4]>
## 2 IN <tibble [92 × 24]> <prcomp> <tibble [24 × 4]>
## 3 MI <tibble [83 × 24]> <prcomp> <tibble [24 × 4]>
## 4 OH <tibble [88 × 24]> <prcomp> <tibble [24 × 4]>
## 5 WI <tibble [72 × 24]> <prcomp> <tibble [24 × 4]>
The pcs list column contains the tidied summary of the PCA. We can unnest it, and draw a graph like before, only with state-level grouping:
state_pca %>%
unnest(cols = c(pcs)) %>%
ggplot(aes(x = PC, y = percent)) +
geom_line(size = 1.1) +
facet_wrap(~ state, nrow = 1) +
labs(x = "Principal Component",
y = "Variance Explained")
We can also use the tools that broom gives us to see where the original data points (the counties) fall in the space created by the PCA. For this we’ll use broom’s augment() function. Augment returns tidy information at the level of the original observations (in this case, the counties in the midwest data). Again, a helper function for clarity.
do_aug <- function(pr){
broom::augment(pr)
}
Let’s just recreate the whole object with the augmented data there as a third list column:
state_pca <- mw_pca %>%
mutate(pca = map(data, do_pca),
pcs = map(pca, do_tidy),
fitted = map(pca, do_aug))
state_pca
## # A tibble: 5 x 5
## # Groups: state [5]
## state data pca pcs fitted
## <chr> <list> <list> <list> <list>
## 1 IL <tibble [102 × 24]> <prcomp> <tibble [24 × 4]> <tibble [102 × 25]>
## 2 IN <tibble [92 × 24]> <prcomp> <tibble [24 × 4]> <tibble [92 × 25]>
## 3 MI <tibble [83 × 24]> <prcomp> <tibble [24 × 4]> <tibble [83 × 25]>
## 4 OH <tibble [88 × 24]> <prcomp> <tibble [24 × 4]> <tibble [88 × 25]>
## 5 WI <tibble [72 × 24]> <prcomp> <tibble [24 × 4]> <tibble [72 × 25]>
You can see that the tibbles in the fitted list column all have 25 columns (the 24 numeric variables in midwest + their id), and a varying number of rows. The number of rows is the number of counties in that state.
Now we plot the counties projected on to the first two principal components, faceted by state. We facet the graph because we ran the PCA separately for each state.
state_pca %>%
unnest(cols = c(fitted)) %>%
ggplot(aes(x = .fittedPC1,
y = .fittedPC2)) +
geom_point() +
facet_wrap(~ state) +
labs(x = "First Principal Component",
y = "Second Principal Component")
It looks like the counties within each state cluster very strongly on the first component (the x-axis), with a couple of outlying counties in each case. The variation is on the second component (the y-axis). Just for the sake of it, because now I’m curious about what those outlying counties are, let’s redo all the steps from the start, this time also holding on to the county names so we can use them the plot. Here we go, all in one breath from the very beginning:
out <- midwest %>%
group_by(state) %>%
select_if(is.numeric) %>%
select(-PID) %>%
nest() %>%
mutate(pca = map(data, do_pca),
pcs = map(pca, do_tidy),
fitted = map(pca, do_aug)) %>%
unnest(cols = c(fitted)) %>%
add_column(county = midwest$county)
ggplot(data = out, aes(x = .fittedPC1,
y = .fittedPC2,
label = county)) +
geom_text(size = 1.9) +
labs(x = "First Principal Component",
y = "Second Principal Component") +
theme_minimal() + facet_wrap(~ state, ncol = 1)
We can do that last add_column(county = midwest$county) step because we know we’ve ended with a tibble where the rows are the same entities (and in the same order) as the original midwest dataset.
You can see that in some cases the big outliers along the x-axis (the first component) are very highly populated counties. E.g. Cook County, IL, is the city of Chicago, Marion County, IN, is Indianapolis, and so on. Meanwhile, on the y-axis (the second cmponent), looking at Illinois we can see DuPage County at one end, a well-to-do exurb of Chicago where Wheaton is located. And at the other end is Alexander County, the southernmost county in Illinois, with a relatively small population (about 8,000 people). Compare the characterizations of Alexander County and DuPage County to get a sense of why the PCA is putting them far apart from one another on the first component.
We can also look at the second and third components. Ideally the interpretation here is something like “Accounting for or excluding or apart from everything that the first component is picking up, how do counties vary or cluster on the next two orthogonal dimensions identified by the PCA?”
ggplot(data = out, aes(x = .fittedPC2,
y = .fittedPC3,
label = county)) +
geom_text(size = 1.9) +
labs(x = "Second Principal Component",
y = "Third Principal Component") +
theme_minimal() + facet_wrap(~ state, ncol = 1)